String Search

2019-12-13(Fri)
Bio/string.py

Notebook

Example Notebook: Kerasy.examples.string.ipynb

Suffix Array

def kerasy.Bio.string.SAIS(string)

Suffix array is a sorted array of all suffixes of a string, and can be built in $\mathrm{O}(n)$. This procedure is as following:

Define the each character S/L-type.
Find the LMS(Left Most S-type)
Fill in $-1$ to initialize Suffix Array.
Prepare the empty buckets whose sizes are correspond to the Initial character grouping.
Sort the LMS
Scan SA in forward order and fill L-type suffixes.
Scan SA in backward order and fill S-type suffixes.

Step1. Definition of S/L-type

If $i$-th suffix is [smaller, bigger] then $i+1$-th suffix in lexicographical order, $i$-th character is [S,L]-type. Exceptionally, "$" is the initial character and defined S-type.
In the above definition, obviously the following holds:
- $i$-th suffix is S-tyep when
  - $S[i]$ = "$"
  - $S[i] < S[i+1]$
  - $S[i] == S[i+1]$ and $i+1$-th suffix is S-type.
- $i$-th suffix is L-tyep when
  - $S[i] > S[i+1]$
  - $S[i] == S[i+1]$ and $i+1$-th suffix is L-type.

$i=5$	$i=4$	$i=3$

Step2. Definition of LMS

If $S[i]$ is the S-type character and $S[i-1]$ is the L-type, $S[i]$ is called LMS(Left Most S-type).

step2

Step3. Initialize Suffix Array.

step3

Step4. Prepare the empty buckets

step4

Step5. Sort the LMS

※ This sorting is not easy, but ignore for consistency of explanation. (Touch again later.)

step5

Induced Sorting

We induce the order from the already known suffix order.

Step6. Induced sorting for L-type suffixes

Scan SA in forward order and fill L-type suffixes. If $S[\text{Focusing Number}-1]$ is L-type, put $\text{Focusing Number}-1$ into the corresponding bucket from the smaller. This is justified by

Same bucket (Same initial)
L-tyep (Smaller than S-type)
One right is the smallest of the rest.

Step6

Step7. Induced sorting for S-type suffixes

Scan SA in backward order and fill S-type suffixes. At this time, we have to reorder the LMS.

Finally, we get

Step7-4

How to Sort LMS suffix?? (Step5)

We could sort entire SA if LMS is correctly sorted. Then How to sort LMS?? The answer of this question is "Sort without Sorting".

Recursion of SA-IS

Step4'

Step4_prime

Step5'

At this time, we couldn't know the order of LMS, so we "equality" instead of "inequality".

Step5_prime

Step6'

Scan SA in forward order and fill L-type suffixes. At this moment, equality relationship is also passed.

Step6_prime

Step7'

Scan SA in backward order and fill S-type suffixes.

Step7_prime

Encode LMS-substrings

We induce the order from the already known suffix order.

Step5 (Recursion)

Step6 (Recursion)

Step7 (Recursion)

We could sort LMS correctly!!

Computational complexity

In the each recursion step, it is guaranteed that the length of encoded strings is less than half of the previous length.

$$\mathrm{O}(n) + \mathrm{O}(n/2) + \mathrm{O}(n/4) + \cdots = \mathrm{O}(2n)$$

Reference

Two Efficient Algorithms for Linear Time Suffix Array Construction

Burrows-Wheeler Transform; BWT

def kerasy.Bio.string.mkBWT(string, SA)

Explanation	ex. S=TATAA$
BWT(Burrows-Wheeler Transform) is the "characters just to the left of the suffixes in the suffix array", and it can be used to recover the entire string. It is given by $$BWT[i] = \begin{cases}S[SA[i] - 1] & (\text{if $SA[i]>0$})\\\mathrm{\$} & (\text{otherwise})\end{cases}$$

BWT is useful for compression, since it tends to rearrange a character string into runs of similar characters. More importantly, this transformation is reversible, without needing to store any additional data. The BWT is thus a "free" method of improving the efficiency of text compression algorithms, costing only some extra computation.

Reverse BWT with Suffix Array

We will denote $T$ and $ISA$ as followings.

$$ \begin{cases} SA[k] = v &\Longleftrightarrow ISA[v] = k\\ \mathrm{T}[i] &\Longleftrightarrow \mathrm{ISA}\left[\mathrm{SA}[i] - 1\right]\\ \end{cases} $$

This means that the order of suffix whose initial character is $BWT[i]$ is $T[i]$, and under the above definition, the following equation holds.

$$ \begin{aligned} \mathrm{BWT}\left[\mathrm{T}\left[i\right]\right] &= \mathrm{BWT}\left[\mathrm{ISA}\left[\mathrm{SA}\left[i\right] - 1\right]\right]\\ &= \mathrm{BWT}\left[k\ |\ \mathrm{SA}[k] = \mathrm{SA}[i] - 1\right]\\ &= \mathrm{S}\left[SA[k] - 1\ |\ \mathrm{SA}[k] = \mathrm{SA}[i] - 1\right]\\ &= \mathrm{S}\left[\left(SA[i] - 1\right) - 1\right]\\ &= \text{$BWT[i]$'s one left character in $S$} \end{aligned} $$

BWT-without-SA1

Now that we can get the original $S$ from BWT and SA.

Reverse BWT without Suffix Array

In fact, we could recover the original strings only from BWT, because we can make $T$ only from BWT. One of the important properties to realize it is LF Mapping.

LF Mapping

The i-th occurrence of a character x in L and the i-th occurrence of x in F correspond to the same occurrence.

To calculate $T[i]$ only from BWT, we will introduce auxiliary storages $C[x]$, and $Occ(x,i)$

Name	Explanation
$C[x]$	the total occurrence of a characters which is smaller than $x$ in lexicographical order in BWT.
$Occ(x,i)$	the total occurrence of a character $x$ up to $i$-th in the BWT.

Using these variable, we can calculate $T[i]$ by

$$T[i] = \begin{cases} 0 &(\mathrm{BWT}[i] = \text{\$})\\ C\left[\mathrm{BWT}[i]\right] + Occ(\mathrm{BWT}[i], i) - 1 & (\text{otherwise.})\\ \end{cases} $$

The above formula is guaranteed by the following properties.

$$ \begin{cases} [lb(W),ub(W)] &= \text{the index range of words in SA whose prefix is W}\\ lb(\{\}) &= 0\\ ub(\{\}) &= len(S) - 1\\ Occ(x, -1) &= 0\\ lb(xW) &= C(x) + Occ(x, lb(W) - 1) &\cdots (1)\\ ub(xW) &= C(x) + Occ(x, ub(W)) - 1 &\cdots (2) \end{cases} $$

These properties (especially the recursion (1) and (2)) are proved by

$(1)$	$(2)$

This allows us to search the occurrence range of characters with a specific prefix.

Reference

Burrows–Wheeler transform and palindromic richness

Longest Common Prefix; LCP

def kerasy.Bio.string.mkHeight(string, SA)

LCP(Longest Common Prefix) is "how many characters two strings have in common with each other", and LCP array is an array where each adjacent index stores lcp between $i$-th suffix and $i+1$-th suffix.

Speaking of LCParray, it usually refers to the suffix array,

Reference

Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications
(If you read this paper, please look this. It will help to read.)

Name	Explanation
\(C[x]\)	the total occurrence of a characters which is smaller than \(x\) in lexicographical order in BWT.
\(Occ(x,i)\)	the total occurrence of a character \(x\) up to \(i\)-th in the BWT.